1 ) Introduction

The following Tutorial is the final assessment of the project seminar “Species Distribution Modeling” at Philipps-University Marburg. In this tutorial we’re going to use the XGBoost algorithm to predict the specie´s distribution of butterflies in Pakistan and create a species richness map of the country. XGBoost (eXtreme Gradient Boosting) is a popular machine learning algorithm that belongs to the family of gradient boosting methods. It was developed by Tianqi Chen. and uses a combination of gradient boosting, decision trees, regularization, gradient-based optimization, feature importance analysis, parallelization. All this make’s it a robust and powerful algorithm that often delivers state-of-the-art results in various machine learning tasks. You will be introduced to the basic concepts of XGBoost and you’ll be provided with a reproducible workflow to use XGBoost to build classification models.

2 ) But what’s XGBoost?

XGBoost is a ensemble Method same as Random Forrest, this means it combines the output of multiple Trees. But the methods differ in the way the idividual Trees are build and how the results are combined. In Xgboost the Output oft the Trees aren’t combined equally. Instead XGBoost uses a method called boosting. Boosting combines weak learner (small trees) sequentually so that the new tree corrects the errors oft he previous one. To understand this we have to look into some mathematical details. But dont worry when using XGBoost these details will be automated. Nevertheless its importand to understand these processes to optimize the algorithm later on.

3 ) So how does it exactly work?

As said XGBoost builds on many concepts to deliver it’s often outstanding results. We’re going to start with the mathematical base concepts of how XGBoost builds trees.

In this assessment were trying to classify geo-points if they’re potential habitats for a number of butterfly species. In order to do that we need XGBoost to build classification trees. XGBoost work’s also with regression but the process differs a little, so were not going to focus on that.


When building classification trees XGBoost evaluates each given point if a condition is met or not. In our case the green dots, with a value of one, represent the presence points. Red dots, with values of zero represent absence points. The black line in the middle is XGBoost’s initial prediction. By default, it is 0.5, which means there is a 50% chance to find a butterfly at any given point.
In the next step XGBoost calculates the residuals of all given points. The residuals are the difference between the observed prediction and the predictet. If the observed value is one (i.e., a presence point), the residuals are 0.5. The same applies to values of zero (i.e., absence points) where the residuals are -0.5.
In order to find the threshhold of parameters that influence the probability of a point being a presence or absence point the data has to be split at the crucial values of that parameter.
Therefore XGBoost splits the observations at thresholds that result in the highest gain value. But whats the gain-value?
To calculate the gain-value we need the similarity-score of each leaf and the root.
By summing the similarity-scores of the left and the right leaf and then substracting the similarity-score of the root we get the gain-value. In this case it’s 3.8 and the highest possible for this dataset therefore this would be the final tree. But why doesn’t XGBoost split the residuals any further? This has to do with regularization parameters and pruning wich we’re going to explain in the next chapter.

3.1 ) Regularization & pruning

How XGBoost builds trees is limited by multiple regularization parameters:

3.1.1 ) Lambda

We’ve heard of Lambda when we’re calculated the similarity-score. XGBoost default value for Lambda is 0 therefore we’ve been ignoring it. But when Lambda is set to >0 the similarity-score get’s smaller because the denominator becomes larger. Thus Lambda prevents over-fitting.

3.1.2 ) Cover/min_child_weigth

Another regularization parameter is the Cover or min_child_weight. This parameter is also the reason why we haven’t continued building our example tree. In XGBoost the default value for the cover is 1 wich means that every leaf with a cover value less than 1 doesn’t get build. The cover for classification tree’s is calculated by summing the previous probability times 1 minus the previous probability, for each residual in the leaf.

3.1.3 ) Gamma

Similar to Cover (min_child_weigth) Gamma is a regularization parameter that causes XGBoost to only build new leafs when the Gain-Value is larger than Gamma.

Therefore it prevents overfitting the trees to our data. Gamma is a highly specialized regularization parameter, what mean’s that there is no “good” value. By default it’s 0 therefore no regularization takes place.

XGBoost substract’s gamma from the Gain-Value and then removes the leaf if the result is a negative number. For example if we take the previous calculated Gain-Value of our example tree of 3.8 a gamma-value of 4 would prune the whole tree down to the root. But if the gamma-value is just 0 XGBoost can build extremly large trees thus overfitting the trees to the dataset and raising the computation time a lot.

3.1.4 ) Learning rate

3.2 ) Evaluation Metrics

3.2.1 ) Logloss

3.2.2 ) Importance / Shap values

4 ) Application of XGBoost in R

XGBoost is available for different programming and scripting languages, include Python, R, Java and C++. Docuementation is available here: https://xgboost.readthedocs.io/en/stable/

4.1 ) Prerequisites

Before you dive into the code you need to install some packages this script will use:

install.packages("terra")
install.packages("ggplot2")
install.packages("fastDummies")
install.packages("tidyterra")

The XGBoost package can by installed in two different ways: * First there is the default package from CRAN, which will do it in most situations.

install.packages("xgboost")
  • But if you are dealing with large data sets you may want to use GPU acceleration. Therefor you have to use a prebuild package from GITHUB (https://github.com/dmlc/xgboost/releases). Download it, place it in the same folder as this script and run the commands below.

TODO: For Benchmarks of a High-End CPU vs Low-End GPU, see https://medium.com/data-design/xgboost-gpu-performance-on-low-end-gpu-vs-high-end-cpu-a7bc5fcd425b. Risks? Reproducability?

##
## !! Installation on windows failed with "Warning in system("sh ./configure.win") 'sh' not found", for dirty Solution see section Troubleshooting
##

# Install dependencies
install.packages(c("data.table", "jsonlite"))

# Install XGBoost
system(paste("R CMD INSTALL ", getwd(),  "/xgboost_r_gpu_win64_21d95f3d8f23873a76f8afaad0fee5fa3e00eafe.tar.gz", sep=""))

Don’t forget to enable CPU acceleration in the knitting paramters.

After installing all needed libaries you need to load them:

require(dplyr)      # easy dataframe modification
require(ggplot2)    # plotting

require(geodata)    # downloading geospatial world dataset made easy
require(sf)         # simple geospatial features
require(terra)      # raster manipulation

require(tidyterra)  # plot terra objects with ggplot

require(fastDummies)# create binary factor columns from character column
require(xgboost)    # our modeling libary
set.seed("101")

4.2 ) Data preparation

Let’s begin with preparing the data used to train the model. Start with getting a overview of the provided data:

species_occurrences_all <- read.table("data/PakistanLadakh.csv", sep=",", header=TRUE)
species_occurrences_all <- sf::st_as_sf(species_occurrences_all, coords=c("x", "y"), remove=TRUE, crs=sf::st_crs("epsg:4326"))

str(species_occurrences_all)
## Classes 'sf' and 'data.frame':   5870 obs. of  2 variables:
##  $ species : chr  "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" ...
##  $ geometry:sfc_POINT of length 5870; first list element:  'XY' num  73.1 34.4
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA
##   ..- attr(*, "names")= chr "species"
ggplot() + 
  geom_sf(data = sf::st_as_sf(species_occurrences_all), mapping=aes(color=species), show.legend = FALSE) +
  ggtitle("Oberserved occurrence of butterflies in Pakistan")

Next we need some environmental data to train the model. Therefor we selected the bioclim data, which are widely used in speceis distribution modeling (Source: https://isprs-archives.copernicus.org/articles/XLII-4-W19/449/2019/isprs-archives-XLII-4-W19-449-2019.pdf#:~:text=The%20earliest%20studies%20of%20SDM%20used%20BIOCLIM%20-,requires%20species%20occurrence%20data%20%28latitude%2C%20longitude%2C%20and%20elevation%29.). The Bioclim layers are missing elevation data, we will use those too, since temperature is dependent on elevation. The Border of Pakistan is also needed, we will crop our data with that, so the model doesn’t train areas we don’t not have presence points. The elevation data is alread cropped so we don’t need to repeat this.

Read more here https://www.worldclim.com/bioclim

# political border of pakistan
border_pak <- geodata::gadm(country='PAK', level = 0, path='./data')
ggplot() +   
  geom_sf(data = sf::st_as_sf(border_pak), fill=NA)

# bioclim data from pakistan
bioclim_pak <- geodata::worldclim_country(country = "PAK", res = 10, var = "bio", path = "data/", version = "2.1")
names(bioclim_pak) <- substr(names(bioclim_pak), 11, 20)


nam <- c("Annual Mean Temperature",
  "Mean Diurnal Range",
  "Isothermality",
  "Temperature Seasonality",
  "Max Temperature of Warmest Month",
  "Min Temperature of Coldest Month",
  "Temperature Annual Range",
  "Mean Temperature of Wettest Quarter",
  "Mean Temperature of Driest Quarter",
  "Mean Temperature of Warmest Quarter",
  "Mean Temperature of Coldest Quarter",
  "Annual Precipitation",
  "Precipitation of Wettest Month",
  "Precipitation of Driest Month",
  "Precipitation Seasonality",
  "Precipitation of Wettest Quarter",
  "Precipitation of Driest Quarter",
  "Precipitation of Warmest Quarter",
  "Precipitation of Coldest Quarter"
  )

bioclim_pak <- terra::mask(bioclim_pak, border_pak)



plot(x=bioclim_pak)

# elevation data form pakistan
elevation_pak <- geodata::elevation_30s(country = 'PAK', path = 'data/')
ggplot() + 
  geom_spatraster(data = elevation_pak) +
  geom_sf(data = sf::st_as_sf(border_pak), fill = NA, show.legend = FALSE) +
  scale_fill_hypso_c(name = "Elevation")

So lets define our species_occurrences as presence points :

species_presence <- species_occurrences_all
# adding col presence to determine presence for model training
species_presence <- cbind(species_presence, data.frame(presence = factor(1)))

rm(species_occurrences_all)

As absence points we will user random Points in Pakistan and combine them with the presence points. The absence points will be extended by a column “species”, which matches the column “species”´in the presence points.

# Generate random points inside pakistan as background points and extend them with a column for species = NA
# TODO: why 1000 points?? give a explanation for the decision
border_pak <- sf::st_as_sf(border_pak)
species_absence <- sf::st_sample(border_pak, size = 1000)

# adding col species = NA to the background points, needed for rbind to join the data
# adding col presence to determine absence for model training
species_absence <- cbind(species_absence, data.frame(species = as.character(NA)))
species_absence <- cbind(species_absence, data.frame(presence = factor(0)))
species_absence <- sf::st_as_sf(species_absence)

# Combine presence and absence (background) points into a single object
modeling_data <- rbind(species_presence, species_absence)

# Only points inside Pakistan should be used for modeling, also remove the columns added by the intersection. 
modeling_data <- sf::st_intersection(modeling_data, border_pak) %>% select(-COUNTRY, -GID_0)

Now we got our presence and absence points as spatial data. Finally we will extract values from our environmental data and add those to our modeling data, so xgboost can use this table to train its model.

# Extract values from bioclim and elevation, join them to our modeling_data
extraction_bioclim_pak <- terra::extract(bioclim_pak, modeling_data, bind=FALSE, ID=FALSE)
extraction_elevation_pak <- terra::extract(elevation_pak, modeling_data, bind=FALSE, ID=FALSE)
modeling_data_extracted <- cbind( modeling_data, extraction_bioclim_pak, extraction_elevation_pak)

Clean up of no longer needed variables and check the final modeling data:

# create a final data variable and clean up variables
modeling_data <- modeling_data_extracted

rm(species_occurrences_all); rm(species_presence); rm(species_absence); rm(modeling_data_); rm(extraction_bioclim_pak); rm(extraction_elevation_pak); rm(modeling_data_extracted)
## Warning in rm(species_occurrences_all): Objekt 'species_occurrences_all' nicht
## gefunden
## Warning in rm(modeling_data_): Objekt 'modeling_data_' nicht gefunden
str(modeling_data)

ggplot() + 
  geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
  geom_sf(data = sf::st_as_sf(modeling_data), mapping=aes(color=species), show.legend = FALSE) +
  ggtitle("Oberserved occurrence of butterflies in Pakistan plus background points")

## Classes 'sf' and 'data.frame':   1397 obs. of  23 variables:
##  $ species    : chr  "Acraea_issoria" "Acraea_issoria" "Acraea_issoria" "Acraea_violae" ...
##  $ presence   : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ bio_1      : num  17.7 12.2 15.4 24.2 13.9 ...
##  $ bio_2      : num  12.1 10.4 11.3 14.5 11.6 ...
##  $ bio_3      : num  37.9 36.5 37.7 38.3 38 ...
##  $ bio_4      : num  719 682 682 833 712 ...
##  $ bio_5      : num  33.8 26.1 30.1 41.7 28.8 ...
##  $ bio_6      : num  1.8 -2.5 0 3.9 -1.7 ...
##  $ bio_7      : num  32 28.6 30.1 37.8 30.5 ...
##  $ bio_8      : num  24.7 19.2 22 31.7 21.2 ...
##  $ bio_9      : num  13.98 9.37 12.07 19.13 10.82 ...
##  $ bio_10     : num  26 20.1 23.2 33.1 22.2 ...
##  $ bio_11     : num  8.3 3.37 6.37 13.07 4.53 ...
##  $ bio_12     : num  1358 1376 1005 312 1130 ...
##  $ bio_13     : num  275 239 169 85 186 190 124 86 181 258 ...
##  $ bio_14     : num  28 26 22 3 23 23 22 15 25 17 ...
##  $ bio_15     : num  68.9 57.8 55.9 99.8 54 ...
##  $ bio_16     : num  628 585 413 190 448 460 310 224 423 593 ...
##  $ bio_17     : num  119 133 99 16 113 103 87 59 107 81 ...
##  $ bio_18     : num  619 572 406 187 441 451 146 91 421 563 ...
##  $ bio_19     : num  251 259 195 38 220 205 174 133 208 187 ...
##  $ PAK_elv_msk: num  1219 2315 1581 166 1980 ...
##  $ geometry   :sfc_POINT of length 1397; first list element:  'XY' num  73.1 34.4
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:22] "species" "presence" "bio_1" "bio_2" ...
species_nsamples = data.frame(modeling_data) %>% 
                    count(species, sort=TRUE) %>% 
                    rename(n_samples = n) %>% 
                    filter(!is.na(species))

ggplot(species_nsamples, aes(n_samples)) +
       geom_histogram(binwidth = 4) +
       geom_vline(aes(xintercept=mean(n_samples)), linetype="dashed") +
       annotate(x=mean(species_nsamples$n_samples), y=+Inf, label=paste("Mean:",round(mean(species_nsamples$n_samples),2)), vjust=3, geom="label") +
       labs(x = "Number of Samples", y = "Number of Species")

rm(species_nsamples)

TODO it would be better to buffer the presence points and remove close absence points

4.3 ) Training and prediction with xgboost

Starting with the training of our xgboost model we decided to do sperate training for every model. Despite that xgboost is capable of Multi-Classificiation. Therefor we defined a function ‘train’ wich invokes out data filtering and xgboost specific data preperation to meet the requirements of xgboost. Espacially converting the modeling data into a ‘xgb.DMatrix’ object. Finally we define some general parameters we want to use, e.g. enable of gpu acceleration by knit paramters. Last but not least we save the model to the disk to preserve it for modeling.

https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.core DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.

train <- function(
    data, # training data, required
    nrounds,
    xgb_params = list(), # xgboost params, see https://xgboost.readthedocs.io/en/latest/parameter.html
    sp, # species name, only required if save_model = TRUE
    save_model = TRUE 
) {
  
  # xgboost needs training data in a specific data format, which provides memory and speed optimization
  data <- xgb.DMatrix(
    data = as.matrix(as.data.frame(data) %>% select(-species, -presence, -geometry)), 
    label = as.matrix(as.data.frame(data) %>% select(presence))
  )
  # PARAM gpu_acc:  
  if(params$gpu_acc){ xgb_params = c(xgb_params, tree_method = "gpu_hist") }
  else{ xgb_params = c(xgb_params, tree_method = "hist") }
  xgb_params = c(xgb_params, objective = "binary:logistic") # logistic regression for binary classification, output probability
  xgb_params = c(xgb_params, eval_metric = "auc")
  
  
  model <- xgboost(data = data,
                 nrounds = nrounds,
                 params = xgb_params,
                 verbose = 0
                 )

  message("avg train-auc:", as.numeric(mean(model[["evaluation_log"]][["train_auc"]])))
  
  if(save_model)
  {
    xgb.save(model, paste("out/", sp, ".model", sep = ""))  # Max compatibility with future xgboost versions
    #save(model, file = paste("out/", sp, ".rds", sep = "")) # Fully restorable r object
  }
  
  return(model)
}

Additional training paramters are defined here in a own data frame. Advantage of this are the comparability and useability of the different parameter sets. The Table show the default parameter, one set taken from Roozbeh Valavi et. al. And last the parameters we will user for training. Those have been tested and tuned manually.

Next is to perpare data used to predic species occurrence in pakistan. Therefore we will use the raw raster data and predict of those with ‘terra::predict’ which allows us to pass on a ‘Spatraster’ object. XGBoost can’t handle Spatraster, so ‘terra:predict’ allows as to define a custom prediction function, which converts data into a matrix.

# gen stack from rasters bioclim_pak and elevation_pak
elev_pak = resample(elevation_pak, bioclim_pak)
ext(elevation_pak) <- ext(bioclim_pak)
prediction_rstack = c(bioclim_pak, elev_pak)

# Remove values outside pakistan, because otherwise the model will make predictions outside the modeling area
prediction_rstack = terra::mask(prediction_rstack, border_pak)
# We need to make a custom predict function for terra::predict() since xgboost didn't take a data.frame as input. See https://stackoverflow.com/questions/71947124/predict-xgboost-model-onto-raster-stack-yields-error
predict_custom <- function(model, data, ...) 
{
  stats::predict(model, newdata=as.matrix(data), ...)
}



predict <- function(model, # xgboost model as object
                    prediction_data # prediction data as spatraster stack
)
{

  #model = xgb.load(paste("out/", sp, "/" ,sp, ".model", sep = ""))
  #model = readRDS(paste("out/", sp, "/" ,sp, ".bin", sep = ""))
  prediction = terra::predict(object=prediction_data,
                       model=model,
                       fun=predict_custom
  )
  return(prediction)
}

4.4 ) Example model for species “Aglais_caschmirensis”

Now we can train and predict our first species:

sp = "Aglais_caschmirensis"

#TODO
# filter modeling data to current species, don't forget the absence points!
data <- modeling_data %>% filter(species == sp | is.na(species))

model <- train(data, nrounds = 10, save_model = FALSE)
## avg train-auc:0.991101886792453
prediction <- predict(model, prediction_rstack)

ggplot() + 
  geom_spatraster(data = prediction) +
  scale_fill_hypso_c(direction = -1,
                     limits=c(0,1),
                     name = "Prediction") +
  geom_sf(data = modeling_data %>% filter(species == sp),
          size = 1,
          shape = 1 ) +
  geom_sf(data = sf::st_as_sf(border_pak),
          fill = NA, show.legend=FALSE) +
  ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan"))
## SpatRaster resampled to ncells = 500703

Next we have on look on some metrics: 1. Cross Entropy Loss https://www.youtube.com/watch?v=Pwgpl9mKars As logloss is very high and the map

  1. Area under the Curve
ggplot(model[["evaluation_log"]]) + 
  geom_point(aes(x=iter,y=train_auc))

  1. Shap values
xgb.ggplot.shap.summary(data = as.matrix(as.data.frame(modeling_data) %>% select(-species, -presence, -geometry)), model = model )

rm(sp); rm(data); rm(model); rm(prediction)

Let’s repeat this with improved parameters: nrounds = 100

## avg train-auc:0.999064716981132
## SpatRaster resampled to ncells = 500703

Let’s repeat this wiht improved parameters: nrounds = 1000

## avg train-auc:0.999906471698113
## SpatRaster resampled to ncells = 500703

Overfitting!! Let’s repeat this wiht improved parameters: nrounds = 1000

Desrcibe Parameters

## avg train-auc:0.990339748427673
## SpatRaster resampled to ncells = 500703

4.4.1 ) Tuning Parameters

  if(plot) {
    #xgb_model[["evaluation_log"]]
    
    importance <- xgb.importance(model = model)
    #xgb.plot.importance(importance_matrix = importance)
    
    #xgb.ggplot.deepness(xgb_model)
    
    #xgb.plot.multi.trees(mode = xgb_model, features_keep = 3)
    
    #library("DiagrammeRsvg", "rsvg")
    
    
    #gr <- xgb.plot.multi.trees(model=xgb_model, features_keep = 5, render=FALSE)
    #DiagrammeR::export_graph(gr, 'tree.pdf', width=600, height=1500)
    
    xgb.ggplot.shap.summary(data = as.matrix(modeling_data %>% select(-species_occurrence, -geometry)), model = model )
    ggsave(paste(sp, "_shap.png", sep=""), path=paste("out/",sp, sep=""))
    rm(importance)
  }

4.5 ) Species Richness Map

Finally we combine all predicted species occurrence into a Map that indicates how many species might occur in one pixel. First, we define a threshold above which the prediction should be considered. The prediction have been saved as tif in ‘out/.tif’ and we need to load them before modifying. After that, we can reclassify the raster with 0 and 1, based on the threshold. To get the number of species in one pixel we need to sum up all rasters into one final raster and plot it using ggplot.

chunk_start <- proc.time()

rm(l_auc)
l_auc <- data.frame(species = character(),
                        auc = numeric())


species = (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species

# xgboost needs the output dir to exist before saving model
dir.create(path="out")

for(sp in species)
{
  loop_start <- proc.time()
  
  message("[", match(sp, species), "/", length(species), "] ", sp, ": ", appendLF=F)
  
  
  
  data <- modeling_data %>% filter(species == sp | is.na(species))

  model <- train(data, 
                 nrounds = 2000,
                 xgb_params = list(eta = 0.15, 
                   max_depth = 8, 
                   gamma = 1.5, 
                   alpha = 0, 
                   lambda = 0.7, 
                   #tree_method = "gpu_hist",
                   #predictor="gpu_predictor",
                   min_child_weight = 1),
                 save_model = FALSE)

  prediction <- predict(model, prediction_rstack)
  
  
  l_auc[nrow(l_auc) + 1,] = c(sp, as.numeric(mean(model[["evaluation_log"]][["train_auc"]])) )
  
  
  terra::writeRaster(prediction, paste("out/", sp, ".tif", sep = ""), overwrite=TRUE)
  
  ggplot() + 
    geom_spatraster(data = prediction) +
    scale_fill_hypso_c(direction = -1,
                       limits=c(0,1),
                       name = "Prediction") +
    geom_sf(data = modeling_data %>% filter(species == sp),
            size = 1,
            shape = 1 ) +
    geom_sf(data = sf::st_as_sf(border_pak),
            fill = NA, show.legend=FALSE) +
    ggtitle(paste("Oberserved and predicted occurrence of", sp, "in pakistan"))
  ggsave(paste(sp, ".png", sep=""), path="out")

  ggplot(model[["evaluation_log"]]) + 
    geom_point(aes(x=iter,y=train_auc))
  ggsave(paste(sp, "_log.png", sep=""), path="out")
  
  
  message("\nExecution took ", (proc.time() - loop_start)[3], " seconds")
}
## [1/25] Acraea_issoria: avg train-auc:0.99818205
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.5 seconds
## [2/25] Acraea_violae: avg train-auc:0.986400583333333
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.91 seconds
## [3/25] Actinor_radians: avg train-auc:0.995415083333333
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.05 seconds
## [4/25] Acytolepis_puspa: avg train-auc:0.99999525
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.59 seconds
## [5/25] Aeromachus_stigmatus: avg train-auc:0.998671027777778
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.63 seconds
## [6/25] Aglais_caschmirensis: avg train-auc:0.995068429245283
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 21.64 seconds
## [7/25] Aglais_rizana: avg train-auc:0.998809380952381
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 22.28 seconds
## [8/25] Anaphaeis_aurota: avg train-auc:0.991001024038462
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 24.86 seconds
## [9/25] Apolria_nabellica: avg train-auc:0.9968332
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.81 seconds
## [10/25] Aporia_belucha: avg train-auc:0.996009045454545
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.83 seconds
## [11/25] Aporia_soracta: avg train-auc:0.999098982142857
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 22.95 seconds
## [12/25] Appias_albina: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.36 seconds
## [13/25] Appias_libythea: avg train-auc:0.9799685
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 21.05 seconds
## [14/25] Argynnis_aglaja: avg train-auc:0.997748507575758
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.15 seconds
## [15/25] Argynnis_childreni: avg train-auc:0.997203609375
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.96 seconds
## [16/25] Argynnis_jainadeva: avg train-auc:0.99949064516129
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.26 seconds
## [17/25] Argynnis_kamala: avg train-auc:0.999459710526316
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.91 seconds
## [18/25] Argynnis_niobe: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.05 seconds
## [19/25] Argynnis_pandora: avg train-auc:0.99776975
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.37 seconds
## [20/25] Argyreus_hyperbius: avg train-auc:0.992652638157895
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.14 seconds
## [21/25] Arhopala_rama: avg train-auc:0.99853975
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 24.0600000000001 seconds
## [22/25] Arhoplala_dodonaea: avg train-auc:0.997440525
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.8299999999999 seconds
## [23/25] Ariadne_ariadne: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.33 seconds
## [24/25] Ariadne_merione: avg train-auc:0.990430384615385
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 20.48 seconds
## [25/25] Aglais_ladakensis: avg train-auc:0.5
## SpatRaster resampled to ncells = 500703
## Saving 7 x 5 in imageSaving 7 x 5 in image
## Execution took 19.9 seconds
message("Total Execution took ", (proc.time() - chunk_start)[3], " seconds")
## Total Execution took 520.99 seconds
rm(chunk_start); rm(species); rm(data); rm(model); rm(prediction)
species_nsamples = data.frame(modeling_data) %>% 
                    count(species, sort=TRUE) %>% 
                    rename(n_samples = n) %>% 
                    filter(!is.na(species))

N_auc <- merge(l_auc, species_nsamples, by="species")
N_auc$auc = as.numeric(N_auc$auc)

ggplot(N_auc, aes(x = n_samples, y = auc)) +
  geom_point() + 
  #ylim(0,1) +
  geom_smooth()  
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

#geom_smooth(aes(colour = "Interpolation"))

#TODO: overall performance: mean of logloss  
# Step 1  generate Raster Stack
# l_species will consist of all 421 species prediction rasters -> RAM usage will be insane ~ 25GB
l_species <- list()

threshold <- 0.5

# reclassify raster:
# value < threshold = 0
# value > threshold = 1
m <- c(0, threshold, 0,
       threshold, 1, 1)
m <- matrix(m, ncol=3, byrow=TRUE)

for(sp in (modeling_data %>% distinct(species) %>% filter(!is.na(species)))$species)
{
  # get species raster from file system
  r <- rast(paste("./out/",sp,".tif", sep = "" ))

  l_species[sp] <- terra::classify(r, m, include.lowest = TRUE)

  rm(r)
} 
## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated

## Warning in `[<-`(`*tmp*`, sp, value = terra::classify(r, m, include.lowest =
## TRUE)): implicit list embedding of S4 objects is deprecated
stack = terra::rast(l_species)
stack = sum(stack)
stack = terra::mask(stack, border_pak)

ggplot() + 
  geom_spatraster(data = stack) +
  scale_fill_hypso_c(palette="spain", name="N° of species" ) +
  #scale_fill_hypso_b(name="N° of species") +
  geom_sf(data = sf::st_as_sf(border_pak), fill=NA, show.legend=FALSE) +
  ggtitle("Species richness of butterflies in Pakistan")
## SpatRaster resampled to ncells = 500703

ggsave("SpeciesRichnessMap.png", path="out") 
## Saving 7 x 5 in image

For how many species is no prediction made? -> count(max(raster) = 0)

5 ) Conclusion

  • Investigate which Predictors have been used most
  • Use Automated tuning to find the best parameters

6 ) Sources

Silge, J.(2021): Use racing methods to tune xgboost models and predict home runs. https://juliasilge.com/blog/baseball-racing/ (last visited 07.07.2023)

7 ) Troubleshooting

Installing XGBoost with GPU accerlation

Pre-built binary packages are offered by XGBoost after someone makes a request on GitHub (https://github.com/dmlc/xgboost/issues/6654). Since the packages are precompiled, it should not be necessary to compile them from source. The installation still fails with error rknitr::inline_expr(“Warning in system(”sh ./configure.win”) ‘sh’ not found”)`. So there are two ways to fix this problem: - Install R-Tools and build from source - Copy src/xgboost.dll from archive (https://github.com/dmlc/xgboost/releases) into your r library manually e.g C:%USERNAME%-library\4.2

Sessioninfo

sessionInfo()
## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Germany.utf8  LC_CTYPE=German_Germany.utf8   
## [3] LC_MONETARY=German_Germany.utf8 LC_NUMERIC=C                   
## [5] LC_TIME=German_Germany.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xgboost_1.7.5.1   fastDummies_1.6.3 tidyterra_0.4.0   sf_1.0-12        
## [5] geodata_0.5-8     terra_1.7-29      ggplot2_3.4.2     dplyr_1.1.2      
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.0   xfun_0.39          bslib_0.4.2        purrr_1.0.1       
##  [5] splines_4.2.3      lattice_0.20-45    colorspace_2.1-0   vctrs_0.6.2       
##  [9] generics_0.1.3     viridisLite_0.4.2  htmltools_0.5.5    mgcv_1.8-42       
## [13] s2_1.1.3           yaml_2.3.7         utf8_1.2.3         rlang_1.1.0       
## [17] e1071_1.7-13       jquerylib_0.1.4    pillar_1.9.0       glue_1.6.2        
## [21] withr_2.5.0        DBI_1.1.3          wk_0.7.3           lifecycle_1.0.3   
## [25] munsell_0.5.0      gtable_0.3.3       ragg_1.2.5         codetools_0.2-19  
## [29] evaluate_0.21      labeling_0.4.2     knitr_1.42         fastmap_1.1.1     
## [33] class_7.3-21       fansi_1.0.4        highr_0.10         Rcpp_1.0.10       
## [37] KernSmooth_2.23-20 scales_1.2.1       classInt_0.4-9     cachem_1.0.7      
## [41] jsonlite_1.8.5     systemfonts_1.0.4  farver_2.1.1       textshaping_0.3.6 
## [45] digest_0.6.31      grid_4.2.3         cli_3.6.1          tools_4.2.3       
## [49] magrittr_2.0.3     sass_0.4.6         proxy_0.4-27       tibble_3.2.1      
## [53] tidyr_1.3.0        pkgconfig_2.0.3    Matrix_1.5-3       data.table_1.14.8 
## [57] rmarkdown_2.21     rstudioapi_0.14    R6_2.5.1           nlme_3.1-162      
## [61] units_0.8-1        compiler_4.2.3